Skip to content

feat: Add a parquet uuid calculation #3440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

NJManganelli
Copy link

Calculate a uuid from parquet metadata, utilizing detailed info of the first and last row_groups plus the col_counts of all row_groups of the file or dataset. At the column-page level, parquet should have a checksum AFAIK, but an approximate calculation that would catch differences in numbers of rows, row groups, columns, compression, etc. that deterministically uses two row groups should be sufficient for the equivalent of what coffea does with root files (which is flag them for changes to recalculate the form, steps, etc.).

https://github.com/scikit-hep/coffea/blob/master/src/coffea/dataset_tools/preprocess.py#L46-L48

Also, the ParquetMetadata namedtuple doesn't appear to be used, at least in this file that's touched. Given there's an extra line to handle not changing the length of returned tuple to try and avoid breaking compatibility with outside users, maybe this should be deprecated and the namedtuple should be used instead?

@NJManganelli NJManganelli changed the title Add a parquet uuid calculation feat: Add a parquet uuid calculation Mar 31, 2025
Copy link

codecov bot commented Mar 31, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.32%. Comparing base (b749e49) to head (eab037d).
Report is 313 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/operations/ak_from_parquet.py 93.05% <100.00%> (+2.01%) ⬆️
src/awkward/operations/ak_metadata_from_parquet.py 100.00% <100.00%> (ø)

... and 189 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ianna
Copy link
Collaborator

ianna commented Apr 17, 2025

@NJManganelli - what is the status of this PR? Are you still working on it? Thanks!

@NJManganelli
Copy link
Author

Hi @ianna I'll add a test, then I think it'll be ready from my side

@NJManganelli
Copy link
Author

Not without a performance penalty, but if it needs to be optimized, we could figure out a smarter but still sufficient calculation (I'd like to ensure that any changes in compression, columns, rows is captured). It's also possible that I didn't explore enough some checksum information that's supposed to be available (but these I think were at the page level or something, and just the loop over all those seems like it would be much worse than this)

Without uuid:

python -m timeit -n 1000 "import test_3440_calculate_parquet_uuid; test_3440_calculate_parquet_uuid.test_parquet_uuid()"
1000 loops, best of 5: 83 usec per loop

With uuid:

tests git:(parquet_uuid) ✗ python -m timeit -n 1000 "import test_3440_calculate_parquet_uuid; test_3440_calculate_parquet_uuid.test_parquet_uuid()"
1000 loops, best of 5: 209 usec per loop

Nick Manganelli added 3 commits April 18, 2025 09:54
…e first and last row_groups plus the col_counts of all row_groups of the file or dataset
@NJManganelli
Copy link
Author

Rebased for my own sanity, and marked ready (presuming all the tests are going to pass, will fix if otherwise)

@NJManganelli NJManganelli marked this pull request as ready for review April 18, 2025 14:55
Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NJManganelli - it looks like uuids do not match:

______________________________ test_parquet_uuid _______________________________

    def test_parquet_uuid():
        meta = metadata_from_parquet(input)
>       assert (
            meta["uuid"]
            == "93fd596534251b1ac750f359d9d55418b02bcddcc79cd5ab1e3d9735941fbcae"
        )
E       AssertionError: assert 'adc236484e10384ad680a1ae2ce1fc675f6f9f0a84a02c651ed91ad97fba41c0' == '93fd596534251b1ac750f359d9d55418b02bcddcc79cd5ab1e3d9735941fbcae'
E         
E         - 93fd596534251b1ac750f359d9d55418b02bcddcc79cd5ab1e3d9735941fbcae
E         + adc236484e10384ad680a1ae2ce1fc675f6f9f0a84a02c651ed91ad97fba41c0

meta       = {'col_counts': [5],
 'columns': ['u1', 'u4', 'u8', 'f4', 'f8', 'raw', 'utf8'],
 'form': RecordForm([BitMaskedForm('u8', NumpyForm('bool'), True, True), BitMaskedForm('u8', NumpyForm('int32'), True, True), BitMaskedForm('u8', NumpyForm('int64'), True, True), BitMaskedForm('u8', NumpyForm('float32'), True, True), BitMaskedForm('u8', NumpyForm('float64'), True, True), BitMaskedForm('u8', ListOffsetForm('i32', NumpyForm('uint8', parameters={'__array__': 'byte'}), parameters={'__array__': 'bytestring'}), True, True), BitMaskedForm('u8', ListOffsetForm('i32', NumpyForm('uint8', parameters={'__array__': 'char'}), parameters={'__array__': 'string'}), True, True)], ['u1', 'u4', 'u8', 'f4', 'f8', 'raw', 'utf8']),
 'fs': <fsspec.implementations.local.LocalFileSystem object at 0x7ff63efc9340>,
 'num_row_groups': 1,
 'num_rows': 5,
 'paths': ['/home/runner/work/awkward/awkward/tests/samples/nullable-record-primitives.parquet'],
 'uuid': 'adc236484e10384ad680a1ae2ce1fc675f6f9f0a84a02c651ed91ad97fba41c0'}

tests/test_3440_calculate_parquet_uuid.py:22: AssertionError

@NJManganelli
Copy link
Author

Aye, looks like this will need to be more selective of what goes into the hash, from that printout. I’ll have a look when I am back from holidays

@NJManganelli
Copy link
Author

@ianna I'm trying a more selective set of key-value pairs, hoping it'll be more stable, but "it works on my machine" the same as the previous one, so need to see what the tests say I think

@NJManganelli NJManganelli requested a review from ianna April 24, 2025 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants